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An important problem in econometrics and marketing is to in¬ 
fer the causal impact that a designed market intervention has ex¬ 
erted on an outcome metric over time. This paper proposes to in¬ 
fer causal impact on the basis of a diffusion-regression state-space 
model that predicts the counterfactual market response in a syn¬ 
thetic control that would have occurred had no intervention taken 
place. In contrast to classical difference-in-differences schemes, state- 
space models make it possible to (i) infer the temporal evolution of 
attributable impact, (ii) incorporate empirical priors on the param¬ 
eters in a fully Bayesian treatment, and (iii) flexibly accommodate 
multiple sources of variation, including local trends, seasonality and 
the time-varying influence of contemporaneous covariates. Using a 
Markov chain Monte Carlo algorithm for posterior inference, we il¬ 
lustrate the statistical properties of our approach on simulated data. 

We then demonstrate its practical utility by estimating the causal 
effect of an online advertising campaign on search-related site vis¬ 
its. We discuss the strengths and limitations of state-space models 
in enabling causal attribution in those settings where a randomised 
experiment is unavailable. The Causalimpact R package provides an 
implementation of our approach. 


1. Introduction. This article proposes an approach to inferring the causal 
impact of a market intervention, such as a new product launch or the onset of 
an advertising campaign. Our method generalises the widely used difference- 
in-differences approach to the time-series setting by explicitly modelling the 
counterfactual of a time series observed both before and after the interven¬ 
tion. It improves on existing methods in two respects: it provides a fully 
Bayesian time-series estimate for the effect; and it uses model averaging to 
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construct the most appropriate synthetic control for modelling the counter- 
factual. The Causalimpact R package provides an implementation of our 
approach (http://google.github.io/CausalImpact/) . 

Inferring the impact of market interventions is an important and timely 
problem. Partly because of recent interest in big data, many firms have 
begun to understand that a competitive advantage can be had by system¬ 
atically using impact measures to inform strategic decision making. An ex¬ 
ample is the use of “A/B experiments” to identify the most effective market 
treatments for the purpose of allocating resources [Danaher and Rust (1996), 
Seggie, Cavusgil and Phelan (2007), Leeflang et al. (2009), Stewart (2009)]. 

Here, we focus on measuring the impact of a discrete marketing event, 
such as the release of a new product, the introduction of a new feature, or 
the beginning or end of an advertising campaign, with the aim of measuring 
the event’s impact on a response metric of interest (e.g., sales). The causal 
impact of a treatment is the difference between the observed value of the 
response and the (unobserved) value that would have been obtained under 
the alternative treatment, that is, the effect of treatment on the treated [An- 
tonakis et al. (2010), Claveau (2012), Cox and Wermuth (2001), Heckman 
and Vytlacil (2007), Hitchcock (2004), Hoover (2012), Kleinberg and Hripc- 
sak (2011), Morgan and Winship (2007), Rubin (1974, 2008)]. In the present 
setting the response variable is a time series, so the causal effect of interest 
is the difference between the observed series and the series that would have 
been observed had the intervention not taken place. 

A powerful approach to constructing the counterfactual is based on the 
idea of combining a set of candidate predictor variables into a single “syn¬ 
thetic control” [Abadie and Gardeazabal (2003), Abadie, Diamond and Hain- 
mueller (2010)]. Broadly speaking, there are three sources of information 
available for constructing an adequate synthetic control. The first is the 
time-series behaviour of the response itself, prior to the intervention. The 
second is the behaviour of other time series that were predictive of the tar¬ 
get series prior to the intervention. Such control series can be based, for 
example, on the same product in a different region that did not receive the 
intervention or on a metric that reflects activity in the industry as a whole. 
In practice, there are often many such series available, and the challenge is to 
pick the relevant subset to use as contemporaneous controls. This selection 
is done on the pre-treatment portion of potential controls; but their value 
for predicting the counterfactual lies in their post-treatment behaviour. As 
long as the control series received no intervention themselves, it is often rea¬ 
sonable to assume the relationship between the treatment and the control 
series that existed prior to the intervention to continue afterwards. Thus, 
a plausible estimate of the counterfactual time series can be computed up 
to the point in time where the relationship between treatment and controls 
can no longer be assumed to be stationary, for example, because one of the 
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controls received treatment itself. In a Bayesian framework, a third source of 
information for inferring the counterfactual is the available prior knowledge 
about the model parameters, as elicited, for example, by previous studies. 

We combine the three preceding sources of information using a state- 
space time-series model, where one component of state is a linear regression 
on the contemporaneous predictors. The framework of our model allows us 
to choose from among a large set of potential controls by placing a spike- 
and-slab prior on the set of regression coefficients and by allowing the model 
to average over the set of controls [George and McCulloch (1997)]. We then 
compute the posterior distribution of the counterfactual time series given 
the value of the target series in the pre-intervention period, along with the 
values of the controls in the post-intervention period. Subtracting the pre¬ 
dicted from the observed response during the post-intervention period gives 
a semiparametric Bayesian posterior distribution for the causal effect (Fig¬ 
ure 1). 

Related work. As with other domains, causal inference in marketing re¬ 
quires subtlety. Marketing data are often observational and rarely follow the 
ideal of a randomised design. They typically exhibit a low signal-to-noise 
ratio. They are subject to multiple seasonal variations, and they are often 
confounded by the effects of unobserved variables and their interactions [for 
recent examples, see Seggie, Cavusgil and Phelan (2007), Stewart (2009), 
Leeflang et al. (2009), Takada and Bass (1998), Chan et al. (2010), Lewis 
and Reiley (2011), Lewis, Rao and Reiley (2011), Vaver and Koehler (2011, 
2012 )]. 

Rigorous causal inferences can be obtained through randomised experi¬ 
ments, which are often implemented in the form of geo experiments 
[Vaver and Koehler (2011, 2012)]. Many market interventions, however, fail 
to satisfy the requirements of such approaches. For instance, advertising 
campaigns are frequently launched across multiple channels, online and of¬ 
fline, which precludes measurement of individual exposure. Campaigns are 
often targeted at an entire country, and one country only, which prohibits the 
use of geographic controls within that country. Likewise, a campaign might 
be launched in several countries but at different points in time. Thus, while 
a large control group may be available, the treatment group often consists 
of no more than one region or a few regions with considerable heterogeneity 
among them. 

A standard approach to causal inference in such settings is based on a 
linear model of the observed outcomes in the treatment and control group 
before and after the intervention. One can then estimate the difference be¬ 
tween (i) the pre-post difference in the treatment group and (ii) the pre-post 
difference in the control group. The assumption underlying such difference- 
in-differences (DD) designs is that the level of the control group provides an 
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Fig. 1. Inferring causal impact through counterfactual predictions, (a) Simulated trajec¬ 
tory of a treated market (Y) with an intervention beginning in January 2014- Two other 
markets (Xi, X 2 ) were not subject to the intervention and allow us to construct a synthetic 
control [cf. Abadie and Gardeazabal (2003), Abadie, Diamond and Hainmueller (2010)]. 
Inverting the state-space model described in the main text yields a prediction of what would 
have happened in Y had the intervention not taken place (posterior predictive expectation 
of the counterfactual with pointwise 95% posterior probability intervals), (b) The differ¬ 
ence between observed data and counterfactual predictions is the inferred causal impact of 
the intervention. Here, predictions accurately reflect the true (Gamma-shaped) impact. A 
key characteristic of the inferred impact series is the progressive widening of the posterior 
intervals (shaded area). This effect emerges naturally from the model structure and agrees 
with the intuition that predictions should become increasingly uncertain as we look further 
and further into the (retrospective) future, (c) Another way of visualizing posterior infer¬ 
ences is by means of a cumulative impact plot. It shows, for each day, the summed effect 
up to that day. Here, the 95% credible interval of the cumulative impact crosses the zero¬ 
line about five months after the intervention, at which point we would no longer declare a 
significant overall effect. 


adequate proxy for the level that would have been observed in the treatment 
group in the absence of treatment [see Lester (1946), Campbell, Stanley and 
Gage (1963), Ashenfelter and Card (1985), Card and Krueger (1993), An- 
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grist and Krueger (1999), Athey and Imbens (2002), Abadie (2005), Meyer 
(1995), Shadish, Cook and Campbell (2002), Donald and Lang (2007), An- 
grist and Pischke (2008), Robinson, McNulty and Krasno (2009), Antonakis 
et al. (2010)]. 

DD designs have been limited in three ways. First, DD is traditionally 
based on a static regression model that assumes i.i.d. data despite the fact 
that the design has a temporal component. When fit to serially correlated 
data, static models yield overoptimistic inferences with too narrow uncer¬ 
tainty intervals [see also Bertrand, Duflo and Mullainathan (2002), Hansen 
(2007a, 2007b), Solon (1984)]. Second, most DD analyses only consider two 
time points: before and after the intervention. In practice, the manner in 
which an effect evolves over time, especially its onset and decay structure, 
is often a key question. 

Third, when DD analyses are based on time series, previous studies have 
imposed restrictions on the way in which a synthetic control is constructed 
from a set of predictor variables, which is something we wish to avoid. 
For example, one strategy [Abadie and Gardeazabal (2003), Abadie, Dia¬ 
mond and Hainmueller (2010)] has been to choose a convex combination 
{wi, ..., wj),Wj >0,'^Wj = loiJ predictor time series in such a way that 
a vector of pre-treatment variables (not time series) Xi characterising the 
treated unit before the intervention is matched most closely by the combi¬ 
nation of pre-treatment variables Xq of the control units w.r.t. a vector of 
importance weights {vi,... ,vj). These weights are themselves determined in 
such a way that the combination of pre-treatment outcome time series of the 
control units most closely matches the pre-treatment outcome time series of 
the treated unit. Such a scheme relies on the availability of interpretable 
characteristics (e.g., growth predictors), and it precludes nonconvex combi¬ 
nations of controls when constructing the weight vector W. We prefer to 
select a combination of control series without reference to external charac¬ 
teristics and purely in terms of how well they explain the pre-treatment out¬ 
come time series of the treated unit (while automatically balancing goodness 
of fit and model complexity through the use of regularizing priors). Another 
idea [Belloni et al. (2013)] has been to use classical variable-selection meth¬ 
ods (such as the Lasso) to find a sparse set of predictors. This approach, 
however, ignores posterior uncertainty about both which predictors to use 
and their coefficients. 

The limitations of DD schemes can be addressed by using state-space 
models, coupled with highly flexible regression components, to explain the 
temporal evolution of an observed outcome. State-space models distinguish 
between a state equation that describes the transition of a set of latent vari¬ 
ables from one time point to the next and an observation equation that 
specifies how a given system state translates into measurements. This dis¬ 
tinction makes them extremely flexible and powerful [see Leeflang et al. 
(2009) for a discussion in the context of marketing research]. 
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The approach described in this paper inherits three main characteristics 
from the state-space paradigm. First, it allows us to flexibly accommodate 
different kinds of assumptions about the latent state and emission processes 
underlying the observed data, including local trends and seasonality. Second, 
we use a fully Bayesian approach to inferring the temporal evolution of 
counterfactual activity and incremental impact. One advantage of this is the 
flexibility with which posterior inferences can be summarised. Third, we use 
a regression component that precludes a rigid commitment to a particular set 
of controls by integrating out our posterior uncertainty about the influence 
of each predictor as well as our uncertainty about which predictors to include 
in the hrst place, which avoids overhtting. 

The remainder of this paper is organised as follows. Section 2 describes 
the proposed model, its design variations, the choice of diffuse empirical 
priors on hyperparameters, and a stochastic algorithm for posterior infer¬ 
ence based on Markov chain Monte Carlo (MCMC). Section 3 demonstrates 
important features of the model using simulated data, followed by an ap¬ 
plication in Section 4 to an advertising campaign run by one of Google’s 
advertisers. Section 5 puts our approach into context and discusses its scope 
of application. 

2. Bayesian structural time-series models. Structural time-series models 
are state-space models for time-series data. They can be defined in terms of 
a pair of equations 

( 2 . 1 ) yt = Zjat + et, 

( 2 . 2 ) at+i=Ttat +Rtm^ 

where et ~ A/’(0,cJj) and r]t ~A/’(0, Qt) are independent of all other un¬ 
knowns. Equation (2.1) is the observation equation] it links the observed 
data ut to a latent d-dimensional state vector at- Equation (2.2) is the state 
equation] it governs the evolution of the state vector at through time. In 
the present paper, yt is a scalar observation, Zt is a d-dimensional output 
vector, Tt is a d X d transition matrix, Rt is a d x q control matrix, £t is 
a scalar observation error with noise variance at, and rjt is a g'-dimensional 
system error with a q x q state-diffusion matrix Qt, where q < d. Writing 
the error structure of equation (2.2) as Rtyt allows us to incorporate state 
components of less than full rank; a model for seasonality will be the most 
important example. 

Structural time-series models are useful in practice because they are flex¬ 
ible and modular. They are flexible in the sense that a very large class of 
models, including all ARIMA models, can be written in the state-space form 
given by (2.1) and (2.2). They are modular in the sense that the latent state 
as well as the associated model matrices Zt,Tt,Rt, and Qt can be assembled 
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from a library of component sub-models to capture important features of the 
data. There are several widely used state-component models for capturing 
the trend, seasonality or effects of holidays. 

A common approach is to assume the errors of different state-component 
models to be independent (i.e., Qt is block-diagonal). The vector at can then 
be formed by concatenating the individual state components, while Tt and 
Rt become block-diagonal matrices. 

The most important state component for the applications considered in 
this paper is a regression component that allows us to obtain counterfactual 
predictions by constructing a synthetic control based on a combination of 
markets that were not treated. Observed responses from such markets are 
important because they allow us to explain variance components in the 
treated market that are not readily captured by more generic seasonal sub¬ 
models. 

This approach assumes that covariates are unaffected by the effects of 
treatment. For example, an advertising campaign run in the United States 
might spill over to Canada or the United Kingdom. When assuming the 
absence of spill-over effects, the use of such indirectly affected markets as 
controls would lead to pessimistic inferences, that is, the effect of the cam¬ 
paign would be underestimated [cf. Meyer (1995)]. 

2.1. Components of state. 

Local linear trend. The hrst component of our model is a local linear 
trend, defined by the pair of equations 

pt+i = lJ^t +St + 7]^ t, 

(2.3) 

St+i — St + ris^t, 

where ~ AA(0,c7^) and Tjs^t ~-^(0)T|)- The pit component is the value of 
the trend at time t. The St component is the expected increase in pt between 
times t and t -|- 1, so it can be thought of as the slope at time t. 

The local linear trend model is a popular choice for modelling trends be¬ 
cause it quickly adapts to local variation, which is desirable when making 
short-term predictions. This degree of flexibility may not be desired when 
making longer-term predictions, as such predictions often come with implau¬ 
sibly wide uncertainty intervals. 

There is a generalisation of the local linear trend model where the slope 
exhibits stationarity instead of obeying a random walk. This model can be 
written as 


(2.4) 


pit+i = lJ-t + St + r]fj,^t, 

St+i = D + p{5t — D) + rjg^f, 
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where the two components of r] are independent. In this model, the slope 
of the time trend exhibits AR(1) variation around a long-term slope of D. 
The parameter |/ 9 | < 1 represents the learning rate at which the local trend is 
updated. Thus, the model balances short-term information with information 
from the distant past. 

Seasonality. There are several commonly used state-component models 
to capture seasonality. The most frequently used model in the time domain 
is 


S-2 



(2.5) 


where S represents the number of seasons and yt denotes their joint con¬ 
tribution to the observed response yt- The state in this model consists of 
the S' — 1 most recent seasonal effects, but the error term is a scalar, so 
the evolution equation for this state model is less than full rank. The mean 
of 7 t+i is such that the total seasonal effect is zero when summed over S 
seasons. For example, if we set S = 4 to capture four seasons per year, the 
mean of the winter coefficient will be — 1 x {spring -|- summer -|- autumn). 
The part of the transition matrix Tt representing the seasonal model is an 
S' — 1 X S' — 1 matrix with — I’s along the top row, I’s along the subdiagonal 
and O’s elsewhere. 

The preceding seasonal model can be generalised to allow for multiple 
seasonal components with different periods. When modelling daily data, for 
example, we might wish to allow for an S = 7 day-of-week effect, as well as an 
S = 52 weekly annual cycle. The latter can be handled by setting Tt = / 5 - 1 , 
with zero variance on the error term, when t is not the start of a new week, 
and setting Tt to the usual seasonal transition matrix, with nonzero error 
variance, when t is the start of a new week. 

Contemporaneous covariates with static coefficients. Control time series 
that received no treatment are critical to our method for obtaining accu¬ 
rate counterfactual predictions since they account for variance components 
that are shared by the series, including, in particular, the effects of other 
unobserved causes otherwise unaccounted for by the model. A natural way 
of including control series in the model is through a linear regression. Its 
coefficients can be static or time-varying. 

A static regression can be written in state-space form by setting Zt = 
/3'^xt and at = 1. One advantage of working in a fully Bayesian treatment is 
that we do not need to commit to a hxed set of covariates. The spike-and- 
slab prior described in Section 2.2 allows us to integrate out our posterior 
uncertainty about which covariates to include and how strongly they should 
influence our predictions, which avoids overfitting. 
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All covariates are assumed to be contemporaneous; the present model does 
not infer on a potential lag between treated and untreated time series. A 
known lag, however, can be easily incorporated by shifting the corresponding 
regressor in time. 

Contemporaneous covariates with dynamie coefficients. An alternative to 
the above is a regression component with dynamic regression coefficients to 
account for time-varying relationships [e.g., Banerjee, Kauffman and Wang 
(2007), West and Harrison (1997)]. Given covariates j = 1,..., J, this intro¬ 
duces the dynamic regression component 


J 



( 2 . 6 ) 




where ~-^(0)T^^.)- Here, is the coefficient for the jth control series 
and ap. is the standard deviation of its associated random walk. We can 
write the dynamic regression component in state-space form by setting Zt = 
'X-t and at = ffi and by setting the corresponding part of the transition matrix 
to Tt = Ij^j, with Qt = diag(cr^p. 

Assembling the state-space model. Structural time-series models allow us 
to examine the time series at hand and flexibly choose appropriate compo¬ 
nents for trend, seasonality, and either static or dynamic regression for the 
controls. The presence or absence of seasonality, for example, will usually be 
obvious by inspection. A more subtle question is whether to choose static 
or dynamic regression coefficients. 

When the relationship between controls and treated unit has been sta¬ 
ble in the past, static coefficients are an attractive option. This is because 
a spike-and-slab prior can be implemented efficiently within a forward- 
hltering, backward-sampling framework. This makes it possible to quickly 
identify a sparse set of covariates even from tens or hundreds of potential 
variables [Scott and Varian (2014)]. Local variability in the treated time 
series is captured by the dynamic local level or dynamic linear trend com¬ 
ponent. Covariate stability is typically high when the available covariates 
are close in nature to the treated metric. The empirical analyses presented 
in this paper, for example, will be based on a static regression component 
(Section 4). This choice provides a reasonable compromise between captur¬ 
ing local behaviour and accounting for regression effects. 

An alternative would be to use dynamic regression coefficients, as we 
do, for instance, in our analyses of simulated data (Section 3). Dynamic 
coefficients are useful when the linear relationship between treated metrics 
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Fig. 2. Graphical model for the static-regression variant of the proposed state-space 
model. Observed market activity yi;n = {yi,... ,yn) is modelled as the result of a latent 
state plus Gaussian observation noise with error standard deviation ay. The state at in¬ 
cludes a local level fit, a local linear trend St, and a set of contemporaneous covariates xt, 
scaled by regression coefficients fg. State components are assumed to evolve according to 
independent Gaussian random walks with fixed standard deviations a^ and as (condition¬ 
al-dependence arrows shown for the first time point only). The model includes empirical 
priors on these parameters and the initial states. In an alternative formulation, the re¬ 
gression coefficients fi are themselves subject to random-walk diffusion (see main text). Of 
principal interest is the posterior predictive density over the unobserved counterfactual re¬ 
sponses ijn+i,... ,ym. Subtracting these from the actual observed data yn+i,... ,ym yields 
a probability density over the temporal evolution of causal impact. 


and controls is believed to change over time. There are a number of ways 
of reducing the computational burden of dealing with a potentially large 
number of dynamic coefficients. One option is to resort to dynamic latent 
factors, where one uses xj = But + i^t with dim(uf) ^ J and uses instead 
of Xt as part of Zt in (2.1), coupled with an AR-type model for ut itself. 
Another option is latent thresholding regression, where one uses a dynamic 
version of the spike-and-slab prior as in Nakajima and West (2013). 

The state-component models are assembled independently, with each com¬ 
ponent providing an additive contribution to yt- Figure 2 illustrates this pro¬ 
cess assuming a local linear trend paired with a static regression component. 

2.2. Prior distributions and prior elicitation. Let 9 generically denote 
the set of all model parameters and let a = {ai,... ,am) denote the full 
state sequence. We adopt a Bayesian approach to inference by specifying 
a prior distribution p{9) on the model parameters as well as a distribution 
p(aol^) on the initial state values. We may then sample from p{a,9\y) using 
MCMC. 
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Most of the models in Section 2.1 depend solely on a small set of variance 
parameters that govern the diffusion of the individual state components. A 
typical prior distribution for such a variance is 


(2.7) 





where G{cL,b) is the Gamma distribution with expectation a/b. The prior 
parameters can be interpreted as a prior sum of squares s, so that s/u is 
a prior estimate of ci^, and u is the weight, in units of prior sample size, 
assigned to the prior estimate. 

We often have a weak default prior belief that the incremental errors in 
the state process are small, which we can formalise by choosing small values 
of zz (e.g., 1) and small values of s/u. The notion of “small” means different 
things in different models; for the seasonal and local linear trend models our 
default priors are l/cr^ ~ 10“^Sy), where Sy = YltiVi ~ “ 1) 

is the sample variance of the target series. Scaling by the sample variance 
is a minor violation of the Bayesian paradigm, but it is an effective means 
of choosing a reasonable scale for the prior. It is similar to the popular 
technique of scaling the data prior to analysis, but we prefer to do the 
scaling in the prior so we can model the data on its original scale. 

When faced with many potential controls, we prefer letting the model 
choose an appropriate set. This can be achieved by placing a spike-and- 
slab prior over coefficients [George and McCulloch (1993, 1997), Poison and 
Scott (2011), Scott and Varian (2014)]. A spike-and-slab prior combines 
point mass at zero (the “spike”), for an unknown subset of zero coefficients, 
with a weakly informative distribution on the complementary set of nonzero 
coefficients (the “slab”). Contrary to what its name might suggest, the “slab” 
is usually not completely flat, but rather a Gaussian with a large variance. 
Let Q = (^ 1 ,..., Qj), where Qj = 1 if 0 and Qj = 0 otherwise. Let Pg 
denote the nonzero elements of the vector /3 and let denote the rows 

and columns of corresponding to nonzero entries in g. We can then 
factorise the spike-and-slab prior as 

(2.8) p{g,P,l/a^)=p{g)p{a^\Q)p{l3g\g,af). 

The spike portion of (2.8) can be an arbitrary distribution over {0, l}*^ in 
principle; the most common choice in practice is a product of independent 
Bernoulli distributions. 


(2.9) p{g) = YlTTf{l-TTj)^ 

i=i 

where tTj is the prior probability of regressor j being included in the model. 
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Values for tTj can be elicited by asking about the expected model size M, 
and then setting all nj = M/J. An alternative is to use a more specific set 
of values vr^. In particular, one might choose to set certain nj to either 1 
or 0 to force the corresponding variables into or out of the model. Generally, 
framing the prior in terms of expected model size has the advantage that the 
model can adapt to growing numbers of predictor variables without having 
to switch to a hierarchical prior [Scott and Berger (2010)]. 

For the “slab” portion of the prior we use a conjugate normal-inverse 
Gamma distribution. 


( 2 . 10 ) 

( 2 . 11 ) 




crt 


2 2 


The vector b in equation (2.10) encodes our prior expectation about the 
value of each element of [3. In practice, we usually set b = 0. The prior 
parameters in equation (2.11) can be elicited by asking about the expected 
€ [0,1] as well as the number of observations worth of weight the prior 
estimate should be given. Then — R^)Sy. 

The final prior parameter in (2.10) is which, up to a scaling factor, is 
the prior precision over /3 in the full model, with all variables included. The 
total information in the covariates is X^X, and so ^X"^X is the average in¬ 
formation in a single observation. Zellner’s g-prior [Zellner (1986), Chipman, 
George and McGulloch (2001), Liang et al. (2008)] sets = ^X"^X, so 
that g can be interpreted as g observations worth of information. Zellner’s 
prior becomes improper when X^X is not positive dehnite; we therefore 
ensure propriety by averaging X"^X with its diagonal, 

(2.12) = -{wX'^X + (!-«;) diag(V^X)} 

n 

with default values oi g = 1 and w = 1/2. Overall, this prior specification 
provides a broadly useful default while providing considerable flexibility in 
those cases where more specific prior information is available. 


2.3. Inference. Posterior inference in our model can be broken down into 
three pieces. First, we simulate draws of the model parameters 6 and the 
state vector a given the observed data yi, ^ in the training period. Second, 
we use the posterior simulations to simulate from the posterior predictive dis¬ 
tribution p(yn-i-i:m|yi;n) over the counterfactual time series yn-i-i;m given 
the observed pre-intervention activity yi:n- Third, we use the posterior pre¬ 
dictive samples to compute the posterior distribution of the pointwise impact 
Vt — Vt for each f = 1,..., m. We use the same samples to obtain the posterior 
distribution of cumulative impact. 
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Posterior simulation. We use a Gibbs sampler to simulate a sequence 
{9 , {6, ... from a Markov chain whose stationary distribution is 

p{9,a\yi:n)- The sampler alternates between a data-augmentation step that 
simulates from p[a\yi-n,9) and a parameter-simulation step that simulates 
from p{e\yi.,n,a). 

The data-augmentation step uses the posterior simulation algorithm from 
Durbin and Koopman (2002), providing an improvement over the earlier 
forward-filtering, backward-sampling algorithms by Carter and Kohn (1994), 
Friihwirth-Schnatter (1994), and de Jong and Shephard (1995). In brief, be¬ 
cause p(yi ■,niOt\9) is jointly multivariate normal, the variance of p(a|yi ■.n-,9) 
does not depend on yi, We can therefore simulate (yj^, n) r^j p{yi-.n,a\9) 
and subtract E(a*|y)^.„,0) to obtain zero-mean noise with the correct vari¬ 
ance. Adding E(a|yi:„,0) restores the correct mean, which completes the 
draw. The required expectations can be computed using the Kalman fil¬ 
ter and a fast mean smoother described in detail by Durbin and Koopman 
(2002). The result is a direct simulation from p{oi\yi-m9) in an algorithm 
that is linear in the total (pre- and post-intervention) number of time points 
(m) and quadratic in the dimension of the state space {d). 

Given the draw of the state, the parameter draw is straightforward for all 
state components other than the static regression coefficients /3. All state 
components that exclusively depend on variance parameters can translate 
their draws back to error terms rg and accumulate sums of squares of ry, 
and, because of conjugacy with equation (2.7), the posterior distribution 
will remain Gamma distributed. 

The draw of the static regression coefficients /3 proceeds as follows. Eor 
each t = l,...,n in the pre-intervention period, let yt denote yt with the 
contributions from the other state components subtracted away, and let 
yi : n = (yi ) • • • j 2/n)• The challenge is to simulate from p { q , I3,a‘^\yi:n), which 
we can factor into p{Q\yi:n)p{^/o'‘^\0,yi:n)pif3\g,o'e,yi:n)- Because of con¬ 
jugacy, we can integrate out /3 and l/fJ^ and be left with 

^o1o^ I- pis) 

(2.13) g|yi:n-C(yi:n) ^^_i^,^, ’ 

where C{yi:n) is an unknown normalizing constant. The sufficient statistics 
in equation (2.13) are 

K-i = 4 = (K-i)-'(Xjyi, „ + 

N = Ve + n, Sg = Se + y?’, nYl: n + 

To sample from (2.13), we use a Gibbs sampler that draws each Qj given 
all other q-j. Each full-conditional is easy to evaluate because Qj can only 
assume two possible values. It should be noted that the dimension of all 
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matrices in (2.13) is qj, which is small if the model is truly sparse. There 
are many matrices to manipulate, but because each is small, the overall 
algorithm is fast. Once the draw of g is complete, we sample directly from 
p{/3,1/a‘l\Q,yi-n) using standard conjugate formulae. For an alternative that 
may be even more computationally efficient, see Ghosh and Clyde (2011). 

Posterior predictive simulation. While the posterior over model parame¬ 
ters and states p{9, Q:|yi: n) can be of interest in its own right, causal impact 
analyses are primarily concerned with the posterior incremental effect, 

(2.14) P{yn+l:m\yi:n,^l-.m)- 

As shown by its indices, the density in equation (2.14) is defined precisely 
for that portion of the time series which is unobserved: the counterfactual 
market response ijn+i, ■ ■ ■ -iVm that would have been observed in the treated 
market, after the intervention, in the absence of treatment. 

It is also worth emphasising that the density is conditional on the ob¬ 
served data (as well as the priors) and only on these, that is, on activity 
in the treatment market before the beginning of the intervention as well as 
activity in all control markets both before and during the intervention. The 
density is not conditioned on parameter estimates or the inclusion or exclu¬ 
sion of covariates with static regression coefficients, all of which have been 
integrated out. Thus, through Bayesian model averaging, we commit neither 
to any particular set of covariates, which helps avoid an arbitrary selection, 
nor to point estimates of their coefficients, which prevents overfitting. 

The posterior predictive density in (2.14) is defined as a coherent (joint) 
distribution over all counterfactual data points, rather than as a collection of 
pointwise univariate distributions. This ensures that we correctly propagate 
the serial structure determined on pre-intervention data to the trajectory 
of counterfactuals. This is crucial, in particular, when forming summary 
statistics, such as the cumulative effect of the intervention on the treatment 
market. 

Posterior inference was implemented in C-|—I- with an R interface. Given 
a typically-sized data set with m = 500 time points, J = 10 covariates, and 
10,000 iterations (see Section 4 for an example), this implementation takes 
less than 30 seconds to complete on a standard computer, enabling near¬ 
interactive analyses. 

2.4. Evaluating impact. Samples from the posterior predictive distribu¬ 
tion over counterfactual activity can be readily used to obtain samples from 
the posterior causal effect, that is, the quantity we are typically interested 
in. For each draw r and for each time point t = n -t- 1,..., m, we set 


(2.15) 
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yielding samples from the approximate posterior predictive density of the 
effect attributed to the intervention. 

In addition to its pointwise impact, we often wish to understand the 
cumulative effect of an intervention over time. One of the main advantages 
of a sampling approach to posterior inference is the flexibility and ease with 
which such derived inferences can be obtained. Reusing the impact samples 
obtained in (2.15), we compute for each draw r 

t 

(2.16) ^ Vt = re + 1,... ,m. 

t'=n+l 

The preceding cumulative sum of causal increments is a useful quantity 
when y represents a flow quantity, measured over an interval of time (e.g., 
a day), such as the number of searches, sign-ups, sales, additional installs or 
new users. It becomes uninterpretable when y represents a stock quantity, 
usefully defined only for a point in time, such as the total number of clients, 
users or subscribers. In this case we might instead choose, for each r, to draw 
a sample of the posterior running average effect following the intervention, 

(2.17) - Vt = re + 1,... ,rre. 

t — re 

£'= 71+1 

Unlike the cumulative effect in (2.16), the running average is always in¬ 
terpretable, regardless of whether it refers to a flow or a stock. However, 
it is more context-dependent on the length of the post-intervention period 
under consideration. In particular, under the assumption of a true impact 
that grows quickly at first and then declines to zero, the cumulative impact 
approaches its true total value (in expectation) as we increase the coun- 
terfactual forecasting period, whereas the average impact will eventually 
approach zero (while, in contrast, the probability intervals diverge in both 
cases, leading to more and more uncertain inferences as the forecasting pe¬ 
riod increases). 

3. Application to simulated data. To study the characteristics of our 
approach, we analysed simulated (i.e., computer-generated) data across a 
series of independent simulations. Generated time series started on 1 January 
2013 and ended on 30 June 2014, with a perturbation beginning on 1 January 
2014. The data were simulated using a dynamic regression component with 
two covariates whose coefficients evolved according to independent random 
walks, flt ~-'^(/dt-i; 0.01^), initialised at /Jq = 1- The covariates themselves 
were simple sinusoids with wavelengths of 90 days and 360 days, respectively. 
The latent state underlying the observed data was generated using a local 
level that evolved according to a random walk, yt ~ A7(/it_i, 0.1^), initialised 
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(a) 


(b) 


(c) 
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Fig. 3. Adequacy of posterior uncertainty, (a) Example of one of the 256 data sets created 
to assess estimation accuracy. Simulated observations (black) are based on two contempo¬ 
raneous covariates, scaled by time-varying coefficients plus a time-varying local level (not 
shown). During the campaign period, where the data are lifted by an effect size of 10%, 
the plot shows the posterior expectation of counterfactual activity (blue), along with its 
pointwise central 95% credible intervals (blue shaded area), and, for comparison, the true 
counterfactual (green), (b) Power curve. Following repeated application of the model to 
simulated data, the plot shows the empirical frequency of concluding that a causal effect 
was present, as a function of true effect size, given a post-intervention period of 6 months. 
The curve represents sensitivity in those parts of the graph where the true effect size is 
positive, and 1—specificity where the true effect size is zero. Error bars represent 95% 
credible intervals for the true sensitivity, using a uniform Beta(l,l) prior, (c) Interval 
coverage. Using an effect size of 10%, the plot shows the proportion of simulations in 
whieh the pointwise central 95% credible interval contained the true impact, as a function 
of campaign duration. Intervals should contain ground truth in 95% of simulations, how¬ 
ever much uncertainty its predictions may be associated with. Error bars represent 95% 
credible intervals. 


at fiQ = 0. Independent observation noise was sampled using £t ~ A^(0,0.1^). 
In summary, observations yt were generated using 

yt = I3t,izt,i + (3t,2Zt,2 + Pt + £t- 

To simulate the effect of advertising, the post-intervention portion of the 
preceding series was multiplied by 1 -|- e, where e (not to be confused with e) 
represented the true effect size specifying the (uniform) relative lift during 
the campaign period. An example is shown in Figure 3(a). 

Sensitivity and specificity. To study the properties of our model, we be¬ 
gan by considering under what circumstances we successfully detected a 
causal effect, that is, the statistical power or sensitivity of our approach. 
A related property is the probability of not detecting an absent impact, 
that is, specificity. We repeatedly generated data, as described above, under 
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different true effect sizes. We then computed the posterior predictive distri¬ 
bution over the counterfactuals and recorded whether or not we would have 
concluded a causal effect. 

For each of the effect sizes 0%, 0.1%, 1%, 10% and 100%, a total of 
2® = 256 simulations were run. This number was chosen simply on the 
grounds that it provided reasonably tight intervals around the reported sum¬ 
mary statistics without requiring excessive amounts of computation. In each 
simulation, we concluded that a causal effect was present if and only if the 
central 95% posterior probability interval of the cumulative effect excluded 
zero. 

The model used throughout this section comprised two structural blocks. 
The first one was a local level component. We placed an inverse-Gamma 
prior on its diffusion variance with a prior estimate of sjv = O.lay and a 
prior sample size v = 2>2. The second structural block was a dynamic regres¬ 
sion component. We placed a Gamma prior with prior expectation Q.lay on 
the diffusion variance of both regression coefficients. By construction, the 
outcome variable did not exhibit any local trends or seasonality other than 
the variation conveyed through the covariates. This obviated the need to 
include an explicit local linear trend or seasonality component in the model. 

In a first analysis, we considered the empirical proportion of simulations 
in which a causal effect had been detected. When taking into account only 
those simulations where the true effect size was greater than zero, these 
empirical proportions provide estimates of the sensitivity of the model w.r.t. 
the process by which the data were generated. Gonversely, those simulations 
where the campaign had no effect yield an estimate of the model’s specificity. 
In this way, we obtained the power curve shown in Figure 3(b). The curve 
shows that, in data such as these, a market perturbation leading to a lift no 
larger than 1% is missed in about 90% of cases. By contrast, a perturbation 
that lifts market activity by 25% is correctly detected as such in most cases. 

In a second analysis, we assessed the coverage properties of the poste¬ 
rior probability intervals obtained through our model. It is desirable to use 
a diffuse prior on the local level component such that central 95% inter¬ 
vals contain ground truth in about 95% of the simulations. This coverage 
frequency should hold regardless of the length of the campaign period. In 
other words, a longer campaign should lead to posterior intervals that are 
appropriately widened to retain the same coverage probability as the nar¬ 
rower intervals obtained for shorter campaigns. This was approximately the 
case throughout the simulated campaign [Figure 3(c)]. 

Estimation accuracy. To study the accuracy of the point estimates sup¬ 
ported by our approach, we repeated the preceding simulations with a fixed 
effect size of 10% while varying the length of the campaign. When given a 
quadratic loss function, the loss-minimizing point estimate is the posterior 
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(a) 


(b) 



Fig. 4. Estimation accuracy, (a) Time series of absolute percentage discrepancy between 
inferred effect and true effect. The plot shows the rate (meani: 2 s.e.m.) at which predic¬ 
tions become less accurate as the length of the counterfaetual forecasting period increases 
(blue). The well-behaved decrease in estimation accuracy breaks down when the data are 
subject to a sudden structural change (red), as simulated for 1 April 2014. (b) Illustration 
of a structural break. The plot shows one example of the time series underlying the red 
curve m (a) . On 1 April 2014, the standard deviation of the generating random walk of 
the local level was tripled, causing the rapid decline in estimation accuracy seen in the red 
curve m (a). 


expectation of the predictive density over counterfactuals. Thus, for each 
generated data set i, we computed the expected causal effect for each time 
point, 

(3.1) {(t)t\yi,---,ym,xi,...,Xm) Vf = n +l,...,m;i = 1,...,256. 


To quantify the discrepancy between estimated and true impact, we calcu¬ 
lated the absolute percentage estimation error. 


(3.2) 


\^i,t 4tt\ 



This yielded an empirical distribution of absolute percentage estimation er¬ 
rors [Figure 4(a), blue], showing that impact estimates become less and less 
accurate as the forecasting period increases. This is because, under the local 
linear trend model in (2.3), the true counterfaetual activity becomes more 
and more likely to deviate from its expected trajectory. 

It is worth emphasising that all preceding results are based on the as¬ 
sumption that the model structure remains intact throughout the modelling 
period. In other words, even though the model is built around the idea of 
multiple (nonstationary) components (i.e., a time-varying local trend and, 
potentially, time-varying regression coefficients), this structure itself remains 
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unchanged. If the model structure does change, estimation accuracy may 
suffer. 

We studied the impact of a changing model structure in a second simula¬ 
tion in which we repeated the procedure above in such a way that 90 days 
after the beginning of the campaign the standard deviation of the random 
walk governing the evolution of the regression coefficient was tripled (now 
0.03 instead of 0.01). As a result, the observed data began to diverge much 
more quickly than before. Accordingly, estimations became considerably less 
reliable [Figure 4(a), red]. An example of the underlying data is shown in 
Figure 4(b). 

The preceding simulations highlight the importance of a model that is 
sufficiently flexible to account for phenomena typically encountered in sea¬ 
sonal empirical data. This rules out entirely static models in particular (such 
as multiple linear regression). 

4. Application to empirical data. To illustrate the practical utility of 
our approach, we analysed an advertising campaign run by one of Google’s 
advertisers in the United States. In particular, we inferred the campaign’s 
causal effect on the number of times a user was directed to the advertiser’s 
website from the Google search results page. We provide a brief overview 
of the underlying data below [see Vaver and Koehler (2011) for additional 
details]. 

The campaign analysed here was based on product-related ads to be dis¬ 
played alongside Google’s search results for specific keywords. Ads went live 
for a period of 6 consecutive weeks and were geo-targeted to a randomised 
set of 95 out of 190 designated market areas (DMAs). The most salient 
observable characteristic of DMAs is offline sales. To produce balance in 
this characteristic, DMAs were first rank-ordered by sales volume. Pairs of 
regions were then randomly assigned to treatment/control. DMAs provide 
units that can be easily supplied with distinct offerings, although this fine¬ 
grained split was not a requirement for the model. In fact, we carried out 
the analysis as if only one treatment region had been available (formed by 
summing all treated DMAs). This allowed us to evaluate whether our ap¬ 
proach would yield the same results as more conventional treatment-control 
comparisons would have done. 

The outcome variable analysed here was search-related visits to the ad¬ 
vertiser’s website, consisting of organic clicks (i.e., clicks on a search result) 
and paid clicks (i.e., clicks on an ad next to the search results, for which the 
advertiser was charged). Since paid clicks were zero before the campaign, 
one might wonder why we could not simply count the number of paid clicks 
after the campaign had started. The reason is that paid clicks tend to can¬ 
nibalise some organic clicks. Since we were interested in the net effect, we 
worked with the total number of clicks. 
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The first building block of the model used for the analyses in this section 
was a local level component. For the inverse-Gamma prior on its diffusion 
variance we used a prior estimate of sjv = Q.lay and a prior sample size 
V = 32. The second structural block was a static regression component. We 
used a spike-and-slab prior with an expected model size of M = 3, an ex¬ 
pected explained variance of B? = 0.8 and 50 prior df. We deliberately kept 
the model as simple as this. Since the covariates came from a randomised 
experiment, we expected them to already account for any additional local 
linear trends and seasonal variation in the response variable. If one suspects 
that a more complex model might be more appropriate, one could opti¬ 
mise model design through Bayesian model selection. Here, we focus instead 
on comparing different sets of covariates, which is critical in counterfactual 
analyses regardless of the particular model structure used. Model estimation 
was carried out using 10,000 MCMC samples. 

Analysis 1: Effect on the treated, using a randomised control. We began 
by applying the above model to infer the causal effect of the campaign on 
the time series of clicks in the treated regions. Given that a set of unaffected 
regions was available in this analysis, the best possible set of controls was 
given by the untreated DMAs themselves (see below for a comparison with 
a purely observational alternative). 

As shown in Figure 5(a), the model provided an excellent fit on the pre¬ 
campaign trajectory of clicks (including a spike in “week —2” and a dip at the 
end of “week —1”). Following the onset of the campaign, observations quickly 
began to diverge from counterfactual predictions; the actual number of clicks 
was consistently higher than what would have been expected in the absence 
of the campaign. The curves did not reconvene until one week after the end 
of the campaign. Subtracting observed from predicted data, as we did in 
Figure 5(b), resulted in a posterior estimate of the incremental lift caused 
by the campaign. It peaked after about three weeks into the campaign, and 
faded away after about one week after the end of the campaign. Thus, as 
shown in Figure 5(c), the campaign led to a sustained cumulative increase in 
total clicks (as opposed to a mere shift of future clicks into the present or a 
pure cannibalization of organic clicks by paid clicks). Specifically, the overall 
effect amounted to 88,400 additional clicks in the targeted regions (posterior 
expectation; rounded to three significant digits), that is, an increase of 22%, 
with a central 95% credible interval of [13%, 30%]. 

To validate this estimate, we returned to the original experimental data, 
on which a conventional treatment-control comparison had been carried out 
using a two-stage linear model [Vaver and Koehler (2011)]. This analysis 
had led to an estimated lift of 84,700 clicks, with a 95% confidence interval 
for the relative expected lift of [19%, 22%]. Thus, with a deviation of less 
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(b) 



(c) 



Fig. 5. Causal effect of online advertising on clicks in treated regions, (a) Time series 
of search-related visits to the advertiser’s website (including both organic and paid clicks). 
(b) Pointwise (daily) incremental impact of the campaign on clicks. Shaded vertical bars 
indicate weekends, (c) Cumulative impact of the campaign on clicks. 


than 5%, the counterfactual approach had led to almost precisely the same 
estimate as the randomised evaluation, except for its wider intervals. The 
latter is expected, given that our intervals represent prediction intervals, 
not confidence intervals. Moreover, in addition to an interval for the sum 
over all time points, our approach yields a full time series of pointwise in¬ 
tervals, which allows analysts to examine the characteristics of the temporal 
evolution of attributable impact. 

The posterior predictive intervals in Figure 5(b) widen more slowly than 
in the illustrative example in Figure 1. This is because the large number of 
controls available in this data set offers a much higher pre-campaign predic¬ 
tive strength than in the simulated data in Figure 1. This is not unexpected, 
given that controls came from a randomised experiment, and we will see that 
this also holds for a subsequent analysis (see below) that is based on yet an- 
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other data source for predictors. A consequence of this is that there is little 
variation left to be captured by the random-walk component of the model. 
A reassuring hnding is that the estimated counterfactual time series in Fig¬ 
ure 5(a) eventually almost exactly rejoins the observed series, only a few 
days after the end of the intervention. 

Analysis 2: Effect on the treated, using observational controls. An impor¬ 
tant characteristic of counterfactual-forecasting approaches is that they do 
not require a setting in which a set of controls, selected at random, was ex¬ 
empt from the campaign. We therefore repeated the preceding analysis in the 
following way: we discarded the data from all control regions and, instead, 
used searches for keywords related to the advertiser’s industry, grouped into 
a handful of verticals, as covariates. In the absence of a dedicated set of 
control regions, such industry-related time series can be very powerful con¬ 
trols, as they capture not only seasonal variations but also market-specihc 
trends and events (though not necessarily advertiser-specific trends). A ma¬ 
jor strength of the controls chosen here is that time series on web searches 
are publicly available through Google Trends (http://www.google.com/ 
trends/). This makes the approach applicable to virtually any kind of in¬ 
tervention. At the same time, the industry as a whole is unlikely to be moved 
by a single actor’s activities. This precludes a positive bias in estimating the 
effect of the campaign that would arise if a covariate was negatively affected 
by the campaign. 

As shown in Figure 6, we found a cumulative lift of 85,900 clicks (pos¬ 
terior expectation), or 21%, with a [12%,30%] interval. In other words, the 
analysis replicated almost perfectly the original analysis that had access to 
a randomised set of controls. One feature in the response variable which 
this second analysis failed to account for was a spike in clicks in the second 
week before the campaign onset; this spike appeared both in treated and 
untreated regions and appears to be specific to this advertiser. In addition, 
the series of point-wise impact [Figure 6(b)] is slightly more volatile than 
in the original analysis (Figure 5). On the other hand, the overall point 
estimate of 85,900, in this case, was even closer to the randomised-design 
baseline (84,700; deviation ca. 1%) than in our first analysis (88,400; devia¬ 
tion ca. 4%). In summary, the counterfactual approach effectively obviated 
the need for the original randomised experiment. Using purely observational 
variables led to the same substantive conclusions. 

Analysis 3: Absence of an effect on the controls. To go one step further 
still, we analysed clicks in those regions that had been exempt from the 
advertising campaign. If the effect of the campaign was truly specific to 
treated regions, there should be no effect in the controls. To test this, we 
inferred the causal effect of the campaign on unaffected regions, which should 
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(b) 



(c) 



Fig. 6. Causal effect of online advertising on clicks, using only searches for keywords 
related to the advertiser’s industry as controls, discarding the original control regions as 
would be the case in studies where a randomised experiment was not carried out. (a) Time 
series of clicks on to the advertiser’s website, (b) Pointwise (daily) incremental impact 
of the campaign on clicks, (c) Cumulative impact of the campaign on clicks. The plots 
show that this analysis, which was based on observational covariates only, provided al¬ 
most exactly the same inferences as the first analysis (Figure 5) that had been based on a 
randomised design. 


not lead to a significant finding. In analogy with our second analysis, we 
discarded clicks in the treated regions and used searches for keywords related 
to the advertiser’s industry as controls. 

As summarised in Figure 7, no significant effect was found in unaffected 
regions, as expected. Specifically, we obtained an overall nonsignificant lift 
of 2% in clicks with a central 95% credible interval of [—6%, 10%]. 

In summary, the empirical data considered in this section showed: (i) a 
clear effect of advertising on treated regions when using randomised control 
regions to form the regression component, replicating previous treatment- 
control comparisons (Figure 5); (ii) notably, an equivalent finding when dis- 
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(b) 



(c) 



Fig. 7. Causal effect of online advertising on clicks in nontreated regions, which should 
not show an effect. Searches for keywords related to the advertiser’s industry are used as 
controls. Plots show inferences in analogy with Figure 5. (a) Time series of clicks to the 
advertiser’s website, (b) Pointwise (daily) incremental impact of the campaign on clicks. 
(c) Cumulative impact of the campaign on clicks. 


carding control regions and instead using observational searches for keywords 
related to the advertiser’s industry as covariates (Figure 6); (iii) reassuringly, 
the absence of an effect of advertising on regions that were not targeted (Fig¬ 
ure 7). 

5. Discussion. The increasing interest in evaluating the incremental im¬ 
pact of market interventions has been reflected by a growing literature on 
applied causal inference. With the present paper we are hoping to contribute 
to this literature by proposing a Bayesian state-space model for obtaining a 
counterfactual prediction of market activity. We discuss the main features 
of this model below. 
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In contrast to most previous schemes, the approach described here is fully 
Bayesian, with regularizing or empirical priors for all hyperparameters. Pos¬ 
terior inference gives rise to complete-data (smoothing) predictions that are 
only conditioned on past data in the treatment market and both past and 
present data in the control markets. Thus, our model embraces a dynamic 
evolution of states and, optionally, coefficients (departing from classical lin¬ 
ear regression models with a fixed number of static regressors), and enables 
us to flexibly summarise posterior inferences. 

Because closed-form posteriors for our model do not exist, we suggest a 
stochastic approximation to inference using MCMC. One convenient conse¬ 
quence of this is that we can reuse the samples from the posterior to obtain 
credible intervals for all summary statistics of interest. Such statistics in¬ 
clude, for example, the average absolute and relative effect caused by the 
intervention as well as its cumulative effect. 

Posterior inference was implemented in C-|—|- and R and, for all empirical 
data sets presented in Section 4, took less than 30 seconds on a standard 
Linux machine. If the computational burden of sampling-based inference 
ever became prohibitive, one option would be to replace it by a variational 
Bayesian approximation [see, e.g., Mathys et al. (2011), Brodersen et al. 
(2013)]. 

Another way of using the proposed model is for power analyses. In par¬ 
ticular, given past time series of market activity, we can define a point in 
the past to represent a hypothetical intervention and apply the model in 
the usual fashion. As a result, we obtain a measure of uncertainty about the 
response in the treated market after the beginning of the hypothetical in¬ 
tervention. This provides an estimate of what incremental effect would have 
been required to be outside of the 95% central interval of what would have 
happened in the absence of treatment. 

The model presented here subsumes several simpler models which, in con¬ 
sequence, lack important characteristics, but which may serve as alternatives 
should the full model appear too complex for the data at hand. One exam¬ 
ple is classical multiple linear regression. In principle, classical regression 
models go beyond difference-in-differences schemes in that they account for 
the full counterfactual trajectory. However, they are not suited for predict¬ 
ing stochastic processes beyond a few steps. This is because ordinary least- 
squares estimators disregard serial autocorrelation; the static model struc¬ 
ture does not allow for temporal variation in the coefficients; and predictions 
ignore our posterior uncertainty about the parameters. Put differently: clas¬ 
sical multiple linear regression is a special case of the state-space model 
described here in which (i) the Gaussian random walk of the local level has 
zero variance; (ii) there is no local linear trend; (iii) regression coefficients 
are static rather than time-varying; (iv) ordinary least squares estimators 
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are used which disregard posterior uncertainty about the parameters and 
may easily overfit the data. 

Another special case of the counterfactual approach discussed in this pa¬ 
per is given by synthetic control estimators that are restricted to the class 
of convex combinations of predictor variables and do not include time-series 
effects such as trends and seasonality [Abadie, Diamond and Hainmueller 
(2010), Abadie (2005)]. Relaxing this restriction means we can utilise pre¬ 
dictors regardless of their scale, even if they are negatively correlated with 
the outcome series of the treated unit. 

Other special cases include autoregressive (AR) and moving-average (MA) 
models. These models define autocorrelation among observations rather than 
latent states, thus precluding the ability to distinguish between state noise 
and observation noise [Ataman, Mela and Van Heerde (2008), Leeflang et al. 
(2009)]. 

In the scenarios we consider, advertising is a planned perturbation of the 
market. This generally makes it easier to obtain plausible causal inferences 
than in genuinely observational studies in which the experimenter had no 
control about treatment [see discussions in Antonakis et al. (2010), Berndt 
(1991), Brady (2002), Camillo and d’Attoma (2010), Hitchcock (2004), Klein- 
berg and Hripcsak (2011), Lewis and Reiley (2011), Lewis, Rao and Reiley 
(2011), Robinson, McNulty and Krasno (2009), Vaver and Koehler (2011), 
Winship and Morgan (1999)]. The principal problem in observational studies 
is endogeneity: the possibility that the observed outcome might not be the 
result of the treatment but of other omitted, endogenous variables. In princi¬ 
ple, propensity scores can be used to correct for the selection bias that arises 
when the treatment effect is correlated with the likelihood of being treated 
[Rubin and Waterman (2006), Chan et al. (2010)]. However, the propensity- 
score approach requires that exposure can be measured at the individual 
level, and it, too, does not guarantee valid inferences, for example, in the 
presence of a specific type of selection bias recently termed “activity bias” 
[Lewis, Rao and Reiley (2011)]. Counterfactual modelling approaches avoid 
these issues when it can be assumed that the treatment market was chosen 
at random. 

Overall, we expect inferences on the causal impact of designed market in¬ 
terventions to play an increasingly prominent role in providing quantitative 
accounts of return on investment [Danaher and Rust (1996), Seggie, Cavusgil 
and Phelan (2007), Leeflang et al. (2009), Stewart (2009)]. This is because 
marketing resources, specifically, can only be allocated to whichever cam¬ 
paign elements jointly provide the greatest return on ad spend (ROAS) if 
we understand the causal effects of spend on sales, product adoption or user 
engagement. At the same time, our approach could be used for many other 
applications involving causal inference. Examples include problems found in 
economics, epidemiology, biology or the political and social sciences. With 
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the release of the Causalimpact R package we hope to provide a simple 
framework serving all of these areas. Structural time-series models are being 
used in an increasing number of applications at Google, and we anticipate 
that they will prove equally useful in many analysis efforts elsewhere. 
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